System and method for content-based partitioning and mining

ABSTRACT

Methods and systems are provided for partitioning data of a database or data store into several independent parts as part of a data mining process. The methods and systems use a mining application having content-based partitioning logic to partition the data. Once the data is partitioned, the partitioned data may be grouped and distributed to an associated processor for further processing. The mining application and content-based partitioning logic may be used in a computing system, including shared memory and distributed memory multi-processor computing systems. Other embodiments are described and claimed.

BACKGROUND

Given modern computing capabilities, it is relatively easy to collectand store vast amounts of data, such as facts, numbers, text, etc. Theissue then becomes how to analyze the vast amount of data to determineimportant data from less important data. The process of filtering thedata to determine important data is often referred to as data mining.Data mining refers to a process of collecting data and analyzing thecollected data from various perspectives, and summarizing any relevantfindings. Locating frequent itemsets in a transaction database hasbecome an important consideration when mining data. For example,frequent itemset mining has been used to locate useful patterns in acustomer's transaction database.

Frequent Itemset Mining (FIM) is the basis of Association Rule Mining(ARM), and has been widely applied in marketing data analysis, proteinsequences, web logs, text, music, stock market, etc. One popularalgorithm for frequent itemset mining is the frequent pattern growth(FP-growth) algorithm. The FP-growth algorithm is used for miningfrequent itemsets in a transaction database. The FP-growth algorithmuses a prefix tree (termed the “FP-tree”) representation of thetransaction database, and is faster than the other mining algorithms,such as the Apriori mining algorithm. The FP-growth algorithm is oftendescribed as a recursive elimination scheme.

As part of a preprocessing step, the FP-growth algorithm deletes allitems from the transactions that are not individually frequent accordingto a defined threshold. That is, the FP-growth algorithm deletes allitems that do not appear in a user-specified minimum number oftransactions. After preprocessing, a FP-tree is built, then theFP-growth algorithm constructs a “conditional pattern base” for eachfrequent item to construct a conditional FP-tree. The FP-growthalgorithm then recursively mines the conditional FP-tree. The patterngrowth is achieved by the concatenation of the suffix pattern with thefrequent patterns generated from the conditional FP-tree.

Since the FP-growth algorithm has been recognized as a powerful tool forfrequent itemset mining, there has been a large amount of research inefforts to implement the FP-growth algorithm in parallel processingcomputers. There have been two main approaches to implement FP-growth:the multiple tree approach and single tree approach. The multiple treeapproach builds multiple FP-trees separately, which results in theintroduction of many redundant nodes. FIG. 6 illustrates the multiplenodes generated by the conventional multiple tree approach with 1, 4, 8,16, 32 and 64 threads (trees). The example database used to generateFIG. 6 is a benchmark dataset “accidents”, which can be found at thelink “http://fimi.cs.helsinki.fi/data/” (the minimal support thresholdis 200,000). As shown, the multiple tree approach will generate two (2)times as many tree nodes on four (4) threads, and about nine (9) timesas many tree nodes on sixty-four (64) threads, as compared to only onethread. The shortcoming of building redundant nodes in multiple treesresults in great memory demand, and sometimes the memory is not largeenough to contain the multiple trees. The previous single approachbuilds only a single FP-tree in memory, but it needs to generate onelock that is associated with each of the tree nodes, thereby limitingscalability.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of a computing device including a miningapplication having content-based partitioning functionality, accordingto an embodiment.

FIGS. 2A-2B depict a flow, diagram which illustrates various logicaloperations by content-based partition logic, according to an embodiment.

FIG. 3 represents a probe tree data structure, under an embodiment.

FIGS. 4A-4D illustrate the first steps in a parallel build of a frequentpattern (FP) tree, according to an embodiment.

FIG. 5 depicts partitioned sub-branches and a grouping result for aprobe tree, according to an embodiment.

FIG. 6 illustrates the multiple nodes generated by a conventionalmultiple tree approach with 1, 4, 8, 16, 32 and 64 threads (trees).

DETAILED DESCRIPTION

Embodiments of methods and systems provide for partitioning data of adatabase or data store into several independent parts as part of a datamining process. The methods and systems use content-based partitioninglogic to partition the data by building a probe structure. In anembodiment, the probe structure includes a probe tree and associatedprobe table. Once the data is partitioned, according to content forexample, the resulting parts may be grouped and distributed to anassociated processor for further processing. The content-basedpartitioning logic may be used in a computing system, including sharedmemory and distributed memory multi-processor computing systems, but isnot so limited.

As described above, after the data is partitioned into independent partsor branches (e.g. disjoint sub-branches), each branch or group ofbranches may then be assigned to a processor (or thread) for furtherprocessing. In an embodiment, after scheduling by a master thread, eachtransaction is distributed to a corresponding processor to finishbuilding a specific branch of the FP-tree. According to the FP-growthalgorithm the data dependency only exists within each sub-branch, andthere is no dependency between the sub-branches. Thus many of the locksrequired in conventional single tree methodologies are unnecessary.Under an embodiment, a database is partitioned based on the content oftransactions in a transaction database. The content-based partitioningsaves memory, and eliminates many locks when building the FP-tree. Thememory savings are based on a number of factors, including thecharacteristics of the database, the support threshold, and the numberof processors. The techniques described herein-provide an efficient andvaluable tool for data mining using multi-core and other processingarchitectures.

In an embodiment, a master/slave processing architecture is used toefficiently allocate processing operations to a number of processors(also referred to as “treads”). The master/slave architecture includes amaster processor (master thread), and any remaining processors (threads)are designated as slave processors (slave threads) when building theFP-tree. One of the tasks of the master thread is to load a transactionfrom a database, prune the transaction, and distribute the prunedtransaction to each of the slave threads for a parallel build of theFP-tree. Each slave thread has its own transaction queue and obtains apruned transaction from the queue each time the master thread assigns apruned transaction to an associated slave thread. Each slave threaddisposes of its pruned transaction to build the FP-tree. Thisarchitecture allows the master thread to do some preliminary measures ofthe transaction before delivering it to a slave thread. According to theresults of the preliminary measures, the master thread may also decidehow to distribute the transaction to a particular thread. In alternativeembodiments, all threads operate independently to build the FP-treeusing the probe structure.

In the following description, numerous specific details are introducedto provide a thorough understanding of, and enabling description for,embodiments of the systems and methods. One skilled in the relevant art,however, will recognize that these embodiments may be practiced withoutone or more of the specific details, or with other components, systems,etc. In other instances, well-known structures or operations are notshown, or are not described in detail, to avoid obscuring aspects of thedisclosed embodiments.

FIG. 1 illustrates a computing device 100 including a mining application101 having content-based partition logic 102 which interacts with a datastore (or database) 104 when performing data mining operations,according to an embodiment. The mining application 101 and content-basedpartition logic 102 is described below in detail. The computing device100 includes any computing system, such as a handheld, mobile computingsystem, a desktop computing system, laptop computing system, and othercomputing systems. The computing device 100 shown in FIG. 1 is referredto as a multi-processor or multi-core device since the architectureincludes multiple processors 106 a-106 x. Tasks may be efficientlydistributed among the processors 106 a-106 x so as not to overwhelm anindividual processor. In other embodiments, the computing device 100 mayinclude a single processor and other components. The computing device100 typically includes system memory 108.

Depending on the configuration and type of computing device, systemmemory 108 may be volatile (such as random-access memory (RAM) or otherdynamic storage), non-volatile (such as read-only memory (ROM), flashmemory, etc.), or some combination.

The system memory 108 may include an operating system 110 and one ormore applications/modules 112. Computing device 100 may includeadditional computer storage 114, such as magnetic storage devices,optical storage devices, etc. Computer storage includes, but is notlimited to, RAM, ROM, electrically erasable programmable read-onlymemory (EEPROM), flash memory or other memory technology, compact diskROM (CD-ROM), digital versatile disks (DVD) or other optical storage,magnetic cassettes, magnetic tape, magnetic disk storage or othermagnetic storage devices, or any other medium which may be used to storeinformation. Computing device 100 may also include one or more inputdevices 116 and one or more output devices 118. Computing device 100 mayalso contain communication connections 120 that allow the computingdevice 100 to communicate with other computing devices 122, processors,and/or systems, such as over a wired and/or wireless network, or othernetwork.

FIGS. 2A-2B depict a flow diagram which illustrates various logicaloperations of the mining application 101 including the content-basedpartition logic 102, according to an embodiment. For this embodiment,processor 106 a (referred herein as the master processor) executes themining application 101 including the content-based partition logic 102,as described below. The transaction database of Table 1 below is used inconjunction with the flowchart to provide an illustrative example usingthe mining application 101 and content-based partition logic 102. Forexample, each transaction in Table 1 may represent a sequence of items,such as items purchased as part of a web-based transaction, wherein eachitem of the transaction is represented by a unique letter.

TABLE 1 Transaction Number Transaction 1 A, B, C, D, E 2 F, B, D, E, G 3A, B, F, G, D 4 B, D, A, E, G 5 B, F, D, G, K 6 A, B, F, G, D 7 A, R, M,K, O 8 B, F, G, A, D 9 A, B, F, M, O 10 A, B, D, E, G 11 B, C, D, E, F12 A, B, D, E, G 13 A, B, F, G, D 14 B, F, D, G, R 15 A, B, D, F, G 16A, R, M, K, J 17 B, F, G, A, D 18 A, B, F, M, O

At 200, the mining application 101 uses the content-based partitioninglogic 102 and begins by scanning the first half of a transactiondatabase, such as the transaction database in Table 1, to determinefrequent items based on a defined support threshold. Different supportthresholds may be implemented depending on a particular application. Forthis example, the support threshold is 12. At 202, each transaction ofthe first half of the transaction database is scanned and the number oftimes that each item occurs in the scan is counted, under an embodiment.Since the search covers half of the transaction database, a supportthreshold of six (6) is used to create the header table for the FP-treebased on the scan of the first half of the database. At 204, it isdetermined if the first half of the database has been scanned. If not,the flow returns to 200, and the next transaction is read. If the firsthalf of the database has been scanned at 204, then the logic 102, at206, creates a header table for the frequent items that meet the minimumsupport threshold based on the scan of the first half of the transactiondatabase.

The logic 102 also assigns an ordering to the frequent items identifiedin the first scan. In an embodiment, the logic 102 orders the frequentitems according to an occurrence frequency in the transaction database.For this example, the logic 102 orders the most frequent items as B thenA then D then F then G (B≧A≧D≧F≧G). This ordering is used to create theheader table and probe structure, described below. For this example,scanning the first half of the transaction database has resulted in theidentification of various items which are inserted into the header table(Table 2). As shown below, after the initial scan of the first half ofthe transaction database, the header table includes the identifiedfrequent items: item B occurring eight (8) times, item A occurring seven(7) times, item D occurring seven (7) times, item F occurring six (6)times, and item G occurring six (6) times.

TABLE 2 Item Count B 8 A 7 D 7 F 6 G 6

At 208, the logic 102 identifies frequent items which are used to builda probe structure. In an embodiment, the logic 102 identifies frequentitems which are used to build a probe structure, including a probe treeand associated probe table, described below. For this example, items B,A, D, and F are identified as items to be used in building a probe tree.At 210, logic 102 creates a root node of the probe tree and beginsbuilding an empty probe table. At 212, the logic 102 begins building theprobe tree from the root node using the first most frequent item in thetransaction database from the identified frequent items B, A, D, and F.At 214, the logic 102 adds a child with its item identification (ID) tothe probe tree. While building the probe tree, the logic 102 beginsfilling in an empty probe table (e.g. counts are set to zero shownbelow). As described above, the probe tree is built by evaluating theidentified frequent items. At 216, the logic 102 determines if there areany remaining frequent items to evaluate. If there are remaining items,the flow returns to 212 and continues as described above. If there areno remaining items, the flow proceeds to 218.

FIG. 3 represents a probe tree data structure (or probe tree) 300 whichincludes the identified items resulting from steps 208-216. A moredetailed description of the probe tree and probe table build follows.According to an embodiment, the logic 102 selects the first “M”identified frequent items to build the probe tree and associated probetable. As a result the probe tree and associated probe table include2^(M) branches. In experiments, M=10 is a sound choice for many cases.As described below, the 2^(M) branches may be further grouped to obtainsubstantially equal groups or chunks for further processing.

For this example, the probe tree 300 is a representation of theidentified frequent items (e.g. B, A, D, F (M=4)) and theirtransactional relationship to one another in accordance with theoccurrence frequency and content. For the first most frequent item, B,the logic 102 creates node 302 which corresponds to the most frequentitem B. The logic 102 takes the next most frequent item, A, and createsnode 304 corresponding to frequent item A. The logic 102 also createsblock 306 which shows that A is a “child” of B. Since A is only a childof B, the logic 102 takes the next most frequent items D, creating node308 corresponding to the next most frequent item. The logic 102 alsocreates blocks 310, 312, and 314, based on the determination that D is achild of B and A for each instance of B and A in the probe tree 300.

For the least most frequent item of this example, F, the logic 102creates node 316 which corresponds to the frequent item, F. The logic102 also creates blocks 318, 320, 322, 324, 326, 328, and 330,illustrating that F is a child of B, A, and D for each instance of B, A,and D in the probe tree 300. The corresponding probe table (Table 3) isshown below for the identified frequent items B, A, D, and F. At thispoint, logic 102 sets the count for each item to zero, but is not solimited. Each row of the probe table (Table 3) represents acontent-based transaction including one or more of the identifiedfrequent items B, A, D, and F. The number “1” is used to represent anoccurrence of a specific item in a transaction including the one or morefrequent items, whereas the number “0” is used to represent one or moreitems that do not occur in an associated transaction.

TABLE 3 B A D F Count 1 1 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 01 0 1 0 0 . . . 0 1 0 0 0 . . . 0 0 0 0 0

With continuing reference to FIG. 2, after creating the probe tree andempty probe table, the flow proceeds to 218, where the logic 102continues the first scan by evaluating the next transaction of thetransaction database. For this embodiment, the logic 102 scans thesecond half of the transaction database (transactions 10-18), evaluatingeach respective transaction. At 220, based on the scanned transactions,the logic 102 provides counts in the probe table and updates the headertable for the frequent items that meet the minimum support threshold.The updated probe table (Table 5) is shown below. At 222, if there areno further transactions in the transaction database, the flow proceedsto 224. Otherwise, if there are further transactions to process, theflow returns to 218. Based on the ordering of the identified frequentitems above (B≧A≧D≧F≧G), the logic 102 updates the header table (Table 4below).

TABLE 4 Item Count B 16 A 14 D 14 F 12 G 12

The updated probe table is shown below (Table 5), including the countfield resulting from the scan of the second half of the transactiondatabase. Each count represents the number of transactions which containthe respective identified frequent items after scanning the second halfof the transaction database. Table 5 represent the content-basedtransactions after the first scan of the transaction database.

TABLE 5 B A D F Count 1 1 1 1 3 1 1 1 0 2 1 1 0 1 1 1 1 0 0 0 1 0 1 1 21 0 1 0 0 . . . 0 1 0 0 1 . . . 0 0 0 0 0

At 224, each transaction or probe tree branch, including one or moreidentified frequent items, may be assigned to one or more processors forfurther conditioning. The probe tree and probe table may be described asa preprocessing of the transaction data which operates to partition thetransaction data for data mining. According to an embodiment, the probetable may be used to balance the processing load by identifying atransaction that consists, of one or more items, wherein eachtransaction includes a unique itemset. More particularly, as describedbelow, the logic 102 attempts to balance the workload, determined by thecontent of a particular transaction and the associated transaction countas identified by using the probe table and probe tree.

Based on a review of the probe table, there are three (3) transactionsthat include the identified frequent items B, A, D, and F. There are two(2) transactions that include the identified frequent items B, A, and D,but not F. There is one transaction that includes the identifiedfrequent items B, A, and F, but not D. There are no transactions thatinclude only the identified frequent items B and A. There are two (2)transactions that include the identified frequent items B, D, and F, butnot A. There is one transaction that includes the identified frequentitem A. The remaining transactions of the probe table may be determinedin a similar manner. As described below, there are fewer locksassociated with the nodes of the probe tree. Note that in theconventional single tree approach, each FP-tree node requires a lock.Since the probe tree does not require as many locks, a system using theprobe tree functionality has increased scalability.

The updated probe table may be used to construct the frequent-patterntree (FP-tree) by assigning transaction chunks to respective processors(such as 106 a-106 x), so that the individual processors each build arespective part of the FP-tree (a parallel build process). A groupingalgorithm, described below, may be used to group branches of the probetree into N groups, where N is the number of available processors orthreads. A group of branches may then be assigned to a respectiveprocessor. The probe tree and probe table provide an important tool forbalancing the computational load between multiple processors when miningdata, including building of the FP-tree.

Under an embodiment, the mining application 101 assigns a group oftransactions to a particular thread, so that each thread may processapproximately the same number of transactions. The assignment of groupsto the processors results in minimal conflicts between these processingthreads. As described below, a heuristic algorithm may be used to groupthe branches of the probe table (e.g. represented by the counts of eachrow) into a number of approximately equal groups which correspond to thenumber of processors used to build the FP-tree and perform the datamining operations.

For example, assume that the computing device 100 includes fourprocessors 106 a-106 d. According to an embodiment, one processor, suchas processor 106 a, is assigned as the master processor, and remainingprocessors, such as 106 b-106 d are assigned as slave processors. Themaster processor 106 a reads each transaction and assigns eachtransaction to a particular slave processor to build the FP-tree. At226, a root node is created for the FP-tree. At 228, the masterprocessor 106 a, using the mining application 101, scans the transactiondatabase once again, starting with the first transaction.

At 230, the master processor 106 a distributes each transaction to aspecific slave processor. At 232, each slave processor reads eachtransaction sent by the master processor 106 a. At 234, each slaveprocessor inserts its pruned-item set into the full FP-tree. At 236, themining application 101 determines whether each transaction of the fulldatabase has been read and distributed. If each transaction of the fulldatabase has been read and distributed, the flow proceeds to 240 and themining application 101 begins mining the FP-tree for frequent patterns.If each transaction of the full database has not been read anddistributed, the flow returns to 228. If there are no more transactionsto process, the flow continues to 240, otherwise the flow returns to232.

The distribution and parallel build of the FP-tree is shown in FIGS.4A-4D. As described above, a heuristic algorithm may be used to groupthe transactions of the probe tree into substantially equal groupingsfor processing by the four processors (the three slave processors buildrespective parts of the FP-tree). For this example, the miningapplication 101 uses the heuristic algorithm to group the branches ofthe probe tree into three groups since there are three slave processors.According to an embodiment, the heuristic algorithm assumes that Vi is avector which contains all chunks that belong to group i. Ci is thenumber of transactions of group i (the size of Vi). The goal is todistribute the transactions into (T−1) groups, making the total numberof transactions in every group approximately equivalent.

The algorithm may be formalized as follows:

V = {V₁, V₂, …  , V_(T − 1)}$f = {\sum\limits_{i = 1}^{T - 1}\left( {{Ci} - \frac{\sum\limits_{j = 1}^{T - 1}{Cj}}{T - 1}} \right)^{2}}$

The goal is to find V to minimize f T input is P={P1, P2, . . . , Pm}.The output is V={V₁, V₂, . . . , V_(T-1)}.

That is, for each P_(i):

Find j, Cj=min (Ck, k=1, 2, . . . , T−1)

Vj=Vj U chunk i

Cj=Cj+sizeof (chunk i)

Return V.

Consequently, after using the heuristic algorithm, the first groupincludes the three (3) transactions that contain the items B, A, D, andF. The second group includes the two (2) transactions with items B, Aand D, but without item F, and the one (1) transaction with items B, A,and F, but without item D (total of three (3) transactions). The thirdgroup includes the two (2) transactions with item B, D and F, butwithout item A, and the one (1) transaction with item A, but withoutitems B, D, and F (total of three (3) transactions). There are no countsfor other branches in the probe tree (see the probe table), so the logic102 only assigns groups of the above five (5) branches based on theresults of the heuristic algorithm. Based on the description above, themaster processor 106 a assigns the first group to the first slaveprocessor 106 b, the second group to the second slave processor 106 c,and the third group to the third slave processor 106 d for a parallelbuild of the FP-tree.

FIGS. 4A-4D illustrate the first steps in a parallel build of theFP-tree by the three slave processors 106 b-106 d. After the headertable and root node of the FP-tree are built as described above, themaster-processor 106 a reads the first transaction (A, B, C, D, E),pruning it to (B, A, D) according to the header table content. Themaster processor 106 a, for this example, assigns (B, A, D) to slaveprocessor 106 c according to the grouping based on the heuristicalgorithm, probe tree, and probe table described above. As shown in FIG.4A, after receiving the pruned items (B, A, D), slave processor 106 csequentially creates nodes B, A, and D. The slave processor 106 c setsthe corresponding node links (shown as dotted lines) for slave processor106 c and also sets each node count to one (1).

The master processor 106 a reads the second transaction (F, B, D, E, G),pruning it to (B, D, F, G) according to the header table content. Themaster processor 106 a then assigns (B, D, F, G) to slave processor 106d. Thereafter, as shown in FIG. 4B, slave processor 106 d inserts thepruned transaction (B, D, F, G) into the FP-tree by sequentiallycreating nodes D, F and G as shown in FIG. 4B. The slave processor 106 dsets the corresponding node links, sets the node count equal to one (1)for D, F, and G, while incrementing the node count for B by one (1). Themaster processor 106 a reads the third transaction (A, B, F, G, D),pruning it to (B, A, D, F, G) according to the header table content. Themaster processor 106 a then assigns the pruned transaction (B, A, D, F,G) to slave processor 106 b. Thereafter, slave processor 106 b insertsthe pruned transaction (B, A, D, F, G) into the FP-tree by sequentiallycreating nodes F and G and setting the corresponding node links as shownin FIG. 4C. The slave processor 106 b sets the node count equal to one(1) for F and G, while incrementing the node count by one (1) for nodesB, A, and D in the common path.

The master processor 106 a reads the fourth transaction (B, D, A, E, G),pruning it to (B, A, D, G) according to the header table content. Themaster processor 106 a then assigns the pruned transaction (B, A, D, G)to slave processor 106 c. Thereafter, slave processor 106 c inserts thepruned transaction (B, A, D, G) into the FP-tree by creating node G andsetting the corresponding node links as shown FIG. 4D. The slaveprocessor 106 c sets the node count equal to one (1) for node G, whileincrementing the node count by one (1) for B, A, and D in the commonpath. The master processor 106 a continues scanning the transactiondatabase, assigning each transaction to respective slave processors asdescribed above. Each slave processor inserts the pruned transactionsinto the FP-tree until the full FP-tree is built including therepresentative links and node counts, as described above. In anembodiment, when creating the FP-tree, the multiple processors may notaccess a node with a lock while processing different transactions.Accordingly, the locks prevent multiple threads from attempting tocreate a duplicate node in the FP-tree. In an embodiment, the FP-tree isbuilt from and saved to system memory 108. In another embodiment, theFP-tree is built from and saved to a distributed memory system.

After the FP-tree is built, the FP-tree may be mined for specificpatterns with respect to the identified frequent items. The FP-tree maybe mined using various data mining algorithms. For example, conditional(pre-fix) based mining operations may be performed using the FP-tree.The mining of the FP-tree according to an embodiment may be summarizedas follows. Each processor may be used to independently mine a frequentitem of the FP-tree. For each frequent item, the mining processconstructs a conditional pattern base” (e.g. a “sub-database” whichconsists of the set of prefix paths in the FP-tree co-occurring with thesuffix pattern.

After constructing the conditional FP-tree, the conditional FP-tree isrecursively mined according to the identified frequent items. Asdescribed above, content-based partitioning logic 102 is used topartition data of a database or data store into several independentparts for subsequent distribution to the one or more processors. Thecontent-based partitioning logic 102 is used is to partition thedatabase equally so that the load is balanced when building and miningthe FP-tree (e.g. each processor gets an approximately equal amount oftransactions to process), but is not so limited.

Another example illustrates the use of the mining application 101including the content-based partitioning logic 102. Assume that adatabase contains eight (8) items (A, B, C, D, E, F, G, H). For thisexample, the database includes a total of eighty (80) transactions.After examining a first part of the database, such as the first half forexample, it is determined that items A, B, and C are the first three (3)most frequent items in the first part of the database. Next, thetransactions may be partitioned according to the content of the first Mmost frequent items. As described above, M may be heuristicallydetermined, based in part, on the number of available processors. Forthe present example, M is set equal to three (M=3), since there are 3most frequent items. Thus, the transactions may be partitioned into2^(M) or eight (8) content-based chunks according to items A, B, and C,A content-based chunk is equal to a sub-branch in an FP-tree. Table 6below provides an example of the distribution of transactions (e.g. 101means the associated chunk includes transactions having items A and C,but not B).

TABLE 6 ABC Transaction count 111 14 110 12 101 10 100 6 011 4 010 5 0015 000 24

Assuming two available slave threads, the content-based partition logic102 groups the eight chunks into two groups, one for each slave thread.As described above, a heuristic search algorithm may be used to groupthe eight chunks into two groups, wherein each group contains about thesame number of transactions. The two resulting groups may include thefirst group of chunks {111, 110, 101, 011} and the second group ofchunks {100, 010, 001, 000}, wherein each group contains forty (40)transactions. FIG. 5 depicts the partitioned sub-branches and groupingresult. The two sub-branches circled with dashed lines represent a firstgroup of chunks assigned to the first thread. The two other sub-branchescircled with solid lines represent the second group of chunks assignedto the second thread.

The equal numbers of transactions processed by each thread results in anefficient balance of the load when building the FP-tree. Additionally,after the partitioning using the probe tree, only eight (8) (2³) locksare required for the possible shared nodes (e.g. only consisting of a,b, c in the FP-tree) in the parallel tree-building process. Compared tothe 256 (2⁸)=256 locks for a conventional build. Thus, thepre-partitioning results in a large reduction in the amount of locks(e.g. power(2, items_num)−power(2, N)). Furthermore, the content-basedpartitioning may not require a duplication of nodes as in theconventional multiple tree approach. Accordingly, the content-basedpartitioning and grouping results in a good load balance among threads,while providing high scalability.

Table 7 below provides experimental results based on a comparison of thecontent-based tree partition and multiple tree approach. The testplatform was a 16-processor shared-memory system. The test benchmarkdatabases “accidents” may be downloaded directly from the link“http://fimi.cs.helsinki.fi/data/”, and the databases “bwebdocs” and“swebdocs” were cut from the database “webdocs.”

TABLE 7 Database 1P 2P 4P 8P 16P Multiple swebdocs 1102 613 359 197 125tree bwebdocs 278 155 92.7 52.8 42.9 accidents 23.2 14.1 8.44 6.62 4.95Tree swebdocs 1031 539 330 187 120 partition bwebdocs 278 155 92.5 53.740.2 accidents 22.3 14.3 8.78 6.20 5.29

Table 7 shows the running time for the whole application, including thefirst scan, the tree construction procedure and the mining stages. Table8 below shows the running times of the mining stage for each approach onall three benchmark databases.

TABLE 8 Database 1P 2P 4P 8P 16P Multiple swebdocs 1087 603 350 187 113tree bwebdocs 258 134 73.5 32.7 19.9 accidents 20.3 11.8 6.44 4.35 2.95Tree swebdocs 1018 527 320 176 110 partition bwebdocs 250 129 70.6 31.917.8 accidents 18.0 10.5 5.63 2.66 1.44

As shown in Tables 7 and 8 above, it is seen that the tree partitionapproach has a comparable performance with the old multiple treeapproach in total running time, and a little better in just the miningstage. The results are reasonable: in first and second stages the treepartition approach does additional partitions for each transaction basedon its content. However, the tree partition approach should be a littlefaster in the mining stage, since there are no duplicated tree nodes.

Aspects of the methods and systems described herein may be implementedas functionality programmed into any of a variety of circuitry,including programmable logic devices (“PLDs”), such as fieldprogrammable gate arrays (“FPGAs”), programmable array logic (“PAL”)devices, electrically programmable logic and memory devices and standardcell-based devices, as well as application specific integrated circuits.Implementations may also include microcontrollers with memory (such asEEPROM), embedded microprocessors, firmware, software, etc. Furthermore,aspects may be embodied in microprocessors having software-based circuitemulation, discrete logic (sequential and combinatorial), customdevices, fuzzy (neural) logic, quantum devices, and hybrids of any ofthe above device types. Of course the underlying device technologies maybe, provided in a variety of component types, e.g., metal-oxidesemiconductor field-effect transistor (“MOSFET”) technologies likecomplementary metal-oxide semiconductor (“CMOS”), bipolar technologieslike emitter-coupled logic (“ECL”), polymer technologies (e.g.,silicon-conjugated polymer and metal-conjugated polymer-metalstructures), mixed analog and digital, etc.

The term “processor” as generally used herein refers to any logicprocessing unit, such as one or more central processing units (“CPU”),digital signal processors (“DSP”), application-specific integratedcircuits (“ASIC”), etc. While the term “component” is generally usedherein, it is understood that “component” includes circuitry,components, modules, and/or any combination of circuitry, components,and/or modules as the terms are known in the art.

The various components and/or functions disclosed herein may bedescribed using any number of combinations of hardware, firmware, and/oras data and/or instructions embodied in various machine-readable orcomputer-readable media, in terms of their behavioral, registertransfer, logic component, and/or other characteristics.Computer-readable media in which such formatted data and/or instructionsmay be embodied include, but are not limited to, non-volatile storagemedia in various forms (e.g., optical, magnetic or semiconductor storagemedia) and carrier waves that may be used to transfer such formatteddata and/or instructions through wireless, optical, or wired signalingmedia or any combination thereof. Examples of transfers of suchformatted data and/or instructions by carrier waves include, but are notlimited to, transfers (uploads, downloads, e-mail, etc.) over theInternet and/or other computer networks via one or more data transferprotocols.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise,” “comprising,” and thelike are to be construed in an inclusive sense as opposed to anexclusive or exhaustive sense; that is to say, in a sense of “including,but not limited to.” Words using the singular or plural number alsoinclude the plural or singular number respectively. Additionally, thewords “herein,” “hereunder,” “above,” “below,” and words of similarimport refer to this application as a whole and not to any particularportions of this application. When the word “or” is used in reference toa list of two or more items, that word covers all of the followinginterpretations of the word: any of the items in the list; all of theitems in the list; and any combination of the items in the list.

The above description of illustrated embodiments is not intended to beexhaustive or limited by the disclosure. While specific embodiments of,and examples for, the systems and methods are described herein forillustrative purposes, various equivalent modifications are possible, asthose skilled in the relevant art will recognize. The teachings providedherein may be applied to other systems and methods, and not only for thesystems and methods described above. The elements and acts of thevarious embodiments described above may be combined to provide furtherembodiments. These and other changes may be made to methods and systemsin light of the above detailed description.

In general, in the following claims, the terms used should not beconstrued to be limited to the specific embodiments disclosed in thespecification and the claims, but should be construed to include allsystems and methods that operate under the claims. Accordingly, themethod and systems are not limited by the disclosure, but instead thescope is to be determined entirely by the claims. While certain aspectsare presented below in certain claim forms, the inventors contemplatethe various aspects in any number of claim forms. Accordingly, theinventors reserve the right to add additional claims after filing theapplication to pursue such additional claim forms for other aspects aswell.

What is claimed is:
 1. A method for mining data of a database,comprising: identifying transaction items of the database anddetermining an occurrence frequency for each item, wherein determiningthe occurrence frequency includes: scanning a first portion of thedatabase; identifying transaction items of the first portion of thedatabase with an occurrence frequency at least equal to a thresholdvalue; scanning a second portion of the database; and identifyingtransaction items of the second portion of the database with anoccurrence frequency at least equal to the threshold value; locking theidentified transaction items to prevent other data mining processes fromselecting the identified transaction items; building a probe structurebased on the identified frequent transaction items with an occurrencefrequency at least equal to twice the threshold value; building aplurality of disjoint branches for the probe structure, wherein eachbranch of the probe structure includes a number of identifiedtransaction items selected based on content of the transaction items andthe occurrence frequency of the transaction items, at least two branchesincludes a common transaction item, and each of the plurality ofdisjoint branches are capable of being executed independently from theother plurality of disjoint branches; building a frequent pattern tree(FP-tree) from the branches of the probe structure; grouping thebranches of the FP-tree into a plurality of groups, the grouping basedon the content of the transaction items of each branch, wherein thenumber of transactions in each of the plurality of groups issubstantially equal; and assigning, via a master processor, each groupof branches of the FP-tree to one of a plurality of slave processors,the plurality of slave processors to execute the transaction itemsidentified by the respective branch in parallel with each other, whereinthe number of transaction items to be executed by each of the pluralityof slave processors is substantially equal.
 2. The method of claim 1,further comprising building the probe structure to include a probe treeand probe table, and using the probe tree and probe table to build theFP-tree for mining the FP-tree to determine frequent data patterns. 3.The method of claim 1, further comprising partitioning the databaseaccording to content of the identified transaction items to obtain theprobe structure, wherein the probe structure includes combinations ofthe identified transaction items and the number of occurrences of one ormore content-based transactions.
 4. A computer-readable non-transitorystorage medium having stored thereon instructions, which when executedin a system operate to manage data of a database by: identifyingtransaction items of the database and determining an occurrencefrequency for each item, wherein determining the occurrence frequencyincludes: scanning a first portion of the database; identifyingtransaction items of the first portion of the database with anoccurrence frequency at least equal to a threshold value; scanning asecond portion of the database; and identifying transaction items of thesecond portion of the database with an occurrence frequency at leastequal to the threshold value; locking the identified transaction itemsto prevent other data mining processes from selecting the identifiedtransaction items; building a probe structure based on the identifiedfrequent transaction items with an occurrence frequency at least equalto twice the threshold value; building a plurality of disjoint branchesfor the probe structure, wherein each branch of the probe structureincludes a number of identified transaction items selected based oncontent of the transaction items and the occurrence frequency of thetransaction items, at least two branches includes a common transactionitem, and each of the plurality of disjoint branches are capable ofbeing executed independently from the other plurality of disjointbranches; building a frequent pattern tree (FP-tree) from the branchesof the probe structure; grouping the branches of the FP-tree into aplurality of groups, the grouping based on the content of thetransaction items of each branch, wherein the number of transactions ineach of the plurality of groups is substantially equal; and assigning,via a master processor, each group of branches of the FP-tree to one ofa plurality of slave processors, the plurality of slave processors toexecute the transaction items identified by the respective branch inparallel with each other, wherein the number of transaction items to beexecuted by each of the plurality of slave processors is substantiallyequal.
 5. The computer-readable non-transitory storage medium of claim4, wherein the instructions, which when executed in a system operate tomanage data of a database further by building the probe structure toinclude a probe tree and probe table, and using the probe tree and probetable to build the FP-tree for mining the FP-tree to determine frequentdata patterns.
 6. A system comprising: a master processor; a pluralityof slave processors; a database; and software to identify transactionitems of the database and determine an occurrence frequency for eachitem, wherein determining the occurrence frequency includes: scanning afirst portion of the database; identifying transaction items of thefirst portion of the database with an occurrence frequency at leastequal to a threshold value; scanning a second portion of the database;and identifying transaction items of the second portion of the databasewith an occurrence frequency at least equal to the threshold value; lockthe identified transaction items to prevent other data mining processesfrom selecting the identified transaction items; build a probe structurebased on the identified frequent transaction items with an occurrencefrequency at least equal to twice the threshold value; build a pluralityof disjoint branches for the probe structure, wherein each branch of theprobe structure includes a number of identified transaction itemsselected based on content of the transaction items and the occurrencefrequency of the transaction items, at least two branches includes acommon transaction item, and each of the plurality of disjoint branchesare capable of being executed independently from the other plurality ofdisjoint branches; build a frequent pattern tree (FP-tree) from thebranches of the probe structure; group the branches of the FP-tree intoa plurality of groups, the grouping based on the content of thetransaction items of each branch, wherein the number of transactions ineach of the plurality of groups is substantially equal; and assign, viaa master processor, each group of branches of the FP-tree to one of aplurality of slave processors, the plurality of slave processors toexecute the transaction items identified by the respective branch inparallel with each other, wherein the number of transaction items to beexecuted by each of the plurality of slave processors is substantiallyequal.
 7. The system of claim 6, the software to further build the probestructure to include a probe tree and probe table, and use the probetree and probe table to build the FP-tree for mining the FP-tree todetermine frequent data patterns.
 8. The system of claim 6, the softwareto further partition the database according to content of the identifiedtransaction items to obtain the probe structure, wherein the probestructure includes combinations of the identified transaction items andthe number of occurrences of one or more content-based transactions.